101 research outputs found

    Exploiting Record Similarity for Practical Vertical Federated Learning

    Full text link
    As the privacy of machine learning has drawn increasing attention, federated learning is introduced to enable collaborative learning without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, existing studies in VFL rarely study the ``record linkage'' process. They either design algorithms assuming the data from different parties have been linked or use simple linkage methods like exact-linkage or top1-linkage. These approaches are unsuitable for many applications, such as the GPS location and noisy titles requiring fuzzy matching. In this paper, we design a novel similarity-based VFL framework, FedSim, which is suitable for more real-world applications and achieves higher performance on traditional VFL tasks. Moreover, we theoretically analyze the privacy risk caused by sharing similarities. Our experiments on three synthetic datasets and five real-world datasets with various similarity metrics show that FedSim consistently outperforms other state-of-the-art baselines

    Privacy-Preserving Gradient Boosting Decision Trees

    Full text link
    The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines

    OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams

    Full text link
    How to get insights from relational data streams in a timely manner is a hot research topic. This type of data stream can present unique challenges, such as distribution drifts, outliers, emerging classes, and changing features, which have recently been described as open environment challenges for machine learning. While existing studies have been done on incremental learning for data streams, their evaluations are mostly conducted with manually partitioned datasets. Thus, a natural question is how those open environment challenges look like in real-world relational data streams and how existing incremental learning algorithms perform on real datasets. To fill this gap, we develop an Open Environment Benchmark named OEBench to evaluate open environment challenges in relational data streams. Specifically, we investigate 55 real-world relational data streams and establish that open environment scenarios are indeed widespread in real-world datasets, which presents significant challenges for stream learning algorithms. Through benchmarks with existing incremental learning algorithms, we find that increased data quantity may not consistently enhance the model accuracy when applied in open environment scenarios, where machine learning models can be significantly compromised by missing values, distribution shifts, or anomalies in real-world data streams. The current techniques are insufficient in effectively mitigating these challenges posed by open environments. More researches are needed to address real-world open environment challenges. All datasets and code are open-sourced in https://github.com/sjtudyq/OEBench

    Effective and Efficient Federated Tree Learning on Hybrid Data

    Full text link
    Federated learning has emerged as a promising distributed learning paradigm that facilitates collaborative learning among multiple parties without transferring raw data. However, most existing federated learning studies focus on either horizontal or vertical data settings, where the data of different parties are assumed to be from the same feature or sample space. In practice, a common scenario is the hybrid data setting, where data from different parties may differ both in the features and samples. To address this, we propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees. With the help of these split rules, we theoretically show that the knowledge of parties can be incorporated into the lower layers of a tree. Based on our theoretical analysis, we propose a layer-level solution that does not need frequent communication traffic to train a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead. HybridTree can achieve up to 8 times speedup compared with the other baselines

    Simulation of upper tropospheric COâ‚‚ from chemistry and transport models

    Get PDF
    The California Institute of Technology/Jet Propulsion Laboratory two-dimensional (2-D), three-dimensional (3-D) GEOS-Chem, and 3-D MOZART-2 chemistry and transport models (CTMs), driven respectively by NCEP2, GEOS-4, and NCEP1 reanalysis data, have been used to simulate upper tropospheric CO2 from 2000 to 2004. Model results of CO2 mixing ratios agree well with monthly mean aircraft observations at altitudes between 8 and 13 km (Matsueda et al., 2002) in the tropics. The upper tropospheric CO2 seasonal cycle phases are well captured by the CTMs. Model results have smaller seasonal cycle amplitudes in the Southern Hemisphere compared with those in the Northern Hemisphere, which are consistent with the aircraft data. Some discrepancies are evident between the model and aircraft data in the midlatitudes, where models tend to underestimate the amplitude of CO2 seasonal cycle. Comparison of the simulated vertical profiles of CO2 between the different models reveals that the convection in the 3-D models is likely too weak in boreal winter and spring. Model sensitivity studies suggest that convection mass flux is important for the correct simulation of upper tropospheric CO2

    CO_2 semiannual oscillation in the middle troposphere and at the surface

    Get PDF
    Using in situ measurements, we find a semiannual oscillation (SAO) in the midtropospheric and surface CO_2. Chemistry transport models (2-D Caltech/JPL model, 3-D GEOS-Chem, and 3-D MOZART-2) are used to investigate possible sources for the SAO signal in the midtropospheric and surface CO_2. From model sensitivity studies, it is revealed that the SAO signal in the midtropospheric CO_2 originates mainly from surface CO_2 with a small contribution from transport fields. It is also found that the source for the SAO signal in surface CO_2 is mostly related to the CO_2 exchange between the biosphere and the atmosphere. By comparing model CO_2 with in situ CO_2 measurements at the surface, we find that models are able to capture both annual and semiannual cycles well at the surface. Model simulations of the annual and semiannual cycles of CO_2 in the tropical middle troposphere agree reasonably well with aircraft measurements
    • …
    corecore